action planning
Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving
Hu, Bin, Lu, Zijian, Liao, Haicheng, Yuan, Chengran, Rao, Bin, Li, Yongkang, Li, Guofa, Cui, Zhiyong, Xu, Cheng-zhong, Li, Zhenning
Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.
Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM
Hori, Chiori, Masuyama, Yoshiki, Jain, Siddarth, Corcodel, Radu, Jha, Devesh, Romeres, Diego, Roux, Jonathan Le
Abstract--Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former . Experiments with the Y ouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.
Simulating Generative Social Agents via Theory-Informed Workflow Design
Yan, Yuwei, Piao, Jinghua, Lan, Xiaochong, Shao, Chenyang, Hui, Pan, Li, Yong
Recent advances in large language models have demonstrated strong reasoning and role-playing capabilities, opening new opportunities for agent-based social simulations. However, most existing agents' implementations are scenario-tailored, without a unified framework to guide the design. This lack of a general social agent limits their ability to generalize across different social contexts and to produce consistent, realistic behaviors. To address this challenge, we propose a theory-informed framework that provides a systematic design process for LLM-based social agents. Our framework is grounded in principles from Social Cognition Theory and introduces three key modules: motivation, action planning, and learning. These modules jointly enable agents to reason about their goals, plan coherent actions, and adapt their behavior over time, leading to more flexible and contextually appropriate responses. Comprehensive experiments demonstrate that our theory-driven agents reproduce realistic human behavior patterns under complex conditions, achieving up to 75% lower deviation from real-world behavioral data across multiple fidelity metrics compared to classical generative baselines. Ablation studies further show that removing motivation, planning, or learning modules increases errors by 1.5 to 3.2 times, confirming their distinct and essential contributions to generating realistic and coherent social behaviors.
A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations
Ren, Zeng, Guan, Xinyi, Rohrmeier, Martin
Patterns are fundamental to human cognition, enabling the recognition of structure and regularity across diverse domains. In this work, we focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data, and develop a candidate computational model of how humans detect and understand such structural repeats. Based on a weighted deduction system, our model infers the minimal generative process of a given sequence in the form of a Template program, a formalism that enriches the context-free grammar with repetition combinators. Such representation efficiently encodes the repetition of sub-computations in a recursive manner. As a proof of concept, we demonstrate the expressiveness of our model on short sequences from music and action planning. The proposed model offers broader insights into the mental representations and cognitive mechanisms underlying human pattern recognition.
In-Context Defense in Computer Agents: An Empirical Study
Yang, Pei, Ci, Hai, Shou, Mike Zheng
Computer agents powered by vision-language models (VLMs) have significantly advanced human-computer interaction, enabling users to perform complex tasks through natural language instructions. However, these agents are vulnerable to context deception attacks, an emerging threat where adversaries embed misleading content into the agent's operational environment, such as a pop-up window containing deceptive instructions. Existing defenses, such as instructing agents to ignore deceptive elements, have proven largely ineffective. As the first systematic study on protecting computer agents, we introduce textbf{in-context defense}, leveraging in-context learning and chain-of-thought (CoT) reasoning to counter such attacks. Our approach involves augmenting the agent's context with a small set of carefully curated exemplars containing both malicious environments and corresponding defensive responses. These exemplars guide the agent to first perform explicit defensive reasoning before action planning, reducing susceptibility to deceptive attacks. Experiments demonstrate the effectiveness of our method, reducing attack success rates by 91.2% on pop-up window attacks, 74.6% on average on environment injection attacks, while achieving 100% successful defenses against distracting advertisements. Our findings highlight that (1) defensive reasoning must precede action planning for optimal performance, and (2) a minimal number of exemplars (fewer than three) is sufficient to induce an agent's defensive behavior.
Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL
Kanazawa, Naoaki, Kawaharazuka, Kento, Obinata, Yoshiki, Okada, Kei, Inaba, Masayuki
Although there is a growing demand for cooking behaviours as one of the expected tasks for robots, a series of cooking behaviours based on new recipe descriptions by robots in the real world has not yet been realised. In this study, we propose a robot system that integrates real-world executable robot cooking behaviour planning using the Large Language Model (LLM) and classical planning of PDDL descriptions, and food ingredient state recognition learning from a small number of data using the Vision-Language model (VLM). We succeeded in experiments in which PR2, a dual-armed wheeled robot, performed cooking from arranged new recipes in a real-world environment, and confirmed the effectiveness of the proposed system.
SurveyAgent: A Conversational System for Personalized and Efficient Research Survey
Wang, Xintao, Chen, Jiangjie, Li, Nianqi, Chen, Lida, Yuan, Xinfeng, Shi, Wei, Ge, Xuyang, Xu, Rui, Xiao, Yanghua
In the rapidly advancing research fields such as AI, managing and staying abreast of the latest scientific literature has become a significant challenge for researchers. Although previous efforts have leveraged AI to assist with literature searches, paper recommendations, and question-answering, a comprehensive support system that addresses the holistic needs of researchers has been lacking. This paper introduces SurveyAgent, a novel conversational system designed to provide personalized and efficient research survey assistance to researchers. SurveyAgent integrates three key modules: Knowledge Management for organizing papers, Recommendation for discovering relevant literature, and Query Answering for engaging with content on a deeper level. This system stands out by offering a unified platform that supports researchers through various stages of their literature review process, facilitated by a conversational interface that prioritizes user interaction and personalization. Our evaluation demonstrates SurveyAgent's effectiveness in streamlining research activities, showcasing its capability to facilitate how researchers interact with scientific literature.
Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models
Fei, Hao, Wu, Shengqiong, Ji, Wei, Zhang, Hanwang, Chua, Tat-Seng
Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper timeorder arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our framework consistently outperforms prior arts with significant margins, especially in the scenario with complex actions. Recently, the theme of AI-Generated Content (AIGC) has witnessed thrilling advancements and remarkable progress, e.g., ChatGPT (Ouyang et al., 2022), DELLE-2 (Ramesh et al., 2022) and Stable Diffusion (SD) (Rombach et al., 2022b). As one of the generative topics, text-to-video synthesis that generates video content complying with the provided textual description has received an increasing number of attention in the community. More recently, diffusion models have emerged to provide a new paradigm of T2V. Compared with previous models, DMs advance in superior generation quality and scaling capability to large datasets (Harvey et al., 2022; Höppe et al., 2022), and thus showing great potential on this track (Mei & Patel, 2022; Luo et al., 2023; Yu et al., 2023; Ni et al., 2023). While the latest DMbased T2V explorations paid much effort into enhancing the quality of video frames, i.e., generating The dynamic scene manager (Dysen) module operates over the input text prompt and produces the enriched dynamic scene graph (DSG), which is encoded by the recurrent graph Transformer (RGTrm), and the resulting fine-grained spatio-temporal scene features are integrated into the video generation (denoising) process.
Ensemble Latent Space Roadmap for Improved Robustness in Visual Action Planning
Lippi, Martina, Welle, Michael C., Gasparri, Andrea, Kragic, Danica
Abstract-- Planning in learned latent spaces helps to decrease the dimensionality of raw observations. In this work, we propose to leverage the ensemble paradigm to enhance the robustness of latent planning systems. We rely on our Latent Space Roadmap (LSR) framework, which builds a graph in a learned structured latent space to perform planning. Given multiple LSR framework instances, that differ either on their latent spaces or on the parameters for constructing the graph, we use the action information as well as the embedded nodes of the produced plans to define similarity measures. These are then utilized to select the most promising plans.
Optimal task and motion planning and execution for human-robot multi-agent systems in dynamic environments
Faroni, Marco, Umbrico, Alessandro, Beschi, Manuel, Orlandini, Andrea, Cesta, Amedeo, Pedrocchi, Nicola
Combining symbolic and geometric reasoning in multi-agent systems is a challenging task that involves planning, scheduling, and synchronization problems. Existing works overlooked the variability of task duration and geometric feasibility that is intrinsic to these systems because of the interaction between agents and the environment. We propose a combined task and motion planning approach to optimize sequencing, assignment, and execution of tasks under temporal and spatial variability. The framework relies on decoupling tasks and actions, where an action is one possible geometric realization of a symbolic task. At the task level, timeline-based planning deals with temporal constraints, duration variability, and synergic assignment of tasks. At the action level, online motion planning plans for the actual movements dealing with environmental changes. We demonstrate the approach effectiveness in a collaborative manufacturing scenario, in which a robotic arm and a human worker shall assemble a mosaic in the shortest time possible. Compared with existing works, our approach applies to a broader range of applications and reduces the execution time of the process.